```{python}
# Imported needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.collections import LineCollection
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
import seaborn as sns
color = sns.color_palette()
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import MiniBatchKMeans, KMeans, DBSCAN
from sklearn.neighbors import NearestNeighbors
import kneed
from sklearn.decomposition import PCA
plt.style.use("fivethirtyeight")
```Financial Institution Fraud Detection Analysis
Blog Post Inspiration and Objectives
In this blog post, I was hoping to detect anamolies in Machine Learning by investigating bank fraud. It’s really hard to find real-life examples of bank transactions given the risk of compromising security protocols and user confidentality agreements. Thus, after a great deal of searching, I found the following dataset from BankSim, an agent-based simulator of bank payments based on a sample of aggregated transactional data provided by a bank in Spain, which is commonly used in financial fraud-detection research due to its highly-realistic nature. With that said, let’s try to analyze this topic with some Machine Learning:
Data Preprocessing - Cleaning and Analytics
First, we will read and display the initial dataset in our file system for this blog post, downloaded from Kaggle. This dataset contains loads of valuable information such as all financial transaction information that you would typically document such as time-step of transaction, IDs of buyer and sellers, ZIP-codes of source and destination locations, amount transferred, and whether every transaction was listed as an objective fraud or legitimate (not-fraud) case.
```{python}
# Reading and displaying the initial dataset
df = pd.read_csv("datasets/bs140513_032310.csv")
df
```| step | customer | age | gender | zipcodeOri | merchant | zipMerchant | category | amount | fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 'C1093826151' | '4' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 4.55 | 0 |
| 1 | 0 | 'C352968107' | '2' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 39.68 | 0 |
| 2 | 0 | 'C2054744914' | '4' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 26.89 | 0 |
| 3 | 0 | 'C1760612790' | '3' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 17.25 | 0 |
| 4 | 0 | 'C757503768' | '5' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 35.72 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 594638 | 179 | 'C1753498738' | '3' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 20.53 | 0 |
| 594639 | 179 | 'C650108285' | '4' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 50.73 | 0 |
| 594640 | 179 | 'C123623130' | '2' | 'F' | '28007' | 'M349281107' | '28007' | 'es_fashion' | 22.44 | 0 |
| 594641 | 179 | 'C1499363341' | '5' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 14.46 | 0 |
| 594642 | 179 | 'C616528518' | '4' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 26.93 | 0 |
594643 rows × 10 columns
Because of the nature of the dataset’s size, I will be sampling here due to computing resources on my device to perform a later DBSCAN (O(n^2) algorithm).
```{python}
# Sampling due to computing resources on my device to perform DBSCAN (O(n^2) algorithm)
max_rows = 200000
df = df.sample(n=max_rows, random_state=42)
df
```| step | customer | age | gender | zipcodeOri | merchant | zipMerchant | category | amount | fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | 'C746862122' | '3' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 11.65 | 0 |
| 470791 | 146 | 'C1760492708' | '1' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 1.60 | 0 |
| 568310 | 172 | 'C1984083185' | '3' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 33.36 | 0 |
| 23709 | 9 | 'C1530262146' | '2' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 8.01 | 0 |
| 49723 | 19 | 'C1471216995' | '3' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | 'C591563666' | '4' | 'M' | '28007' | 'M85975013' | '28007' | 'es_food' | 14.83 | 0 |
| 570570 | 173 | 'C19820082' | '5' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 14.11 | 0 |
| 136543 | 48 | 'C612045723' | '5' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 43.26 | 0 |
| 203448 | 69 | 'C1776310825' | '4' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 7.59 | 0 |
| 577977 | 175 | 'C1267749701' | '2' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 36.30 | 0 |
200000 rows × 10 columns
For clarity on the constraints and parameters of the working datasets, I went to find high-level exploratory statistics on all of the datasets: shape, information about all of the entries, etc.
```{python}
# Determining the shape of the initial dataset
df.shape
```(200000, 10)
```{python}
# Getting a sample of the initial dataset through the seeing the first 10 entries
# completely in the dataset
df.head(10)
```| step | customer | age | gender | zipcodeOri | merchant | zipMerchant | category | amount | fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | 'C746862122' | '3' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 11.65 | 0 |
| 470791 | 146 | 'C1760492708' | '1' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 1.60 | 0 |
| 568310 | 172 | 'C1984083185' | '3' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 33.36 | 0 |
| 23709 | 9 | 'C1530262146' | '2' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 8.01 | 0 |
| 49723 | 19 | 'C1471216995' | '3' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 38.11 | 0 |
| 96308 | 35 | 'C109617474' | '2' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 66.43 | 0 |
| 292707 | 96 | 'C1376617526' | '3' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 10.24 | 0 |
| 241145 | 81 | 'C1092512638' | '3' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 24.52 | 0 |
| 85866 | 31 | 'C1560676680' | '2' | 'F' | '28007' | 'M980657600' | '28007' | 'es_sportsandtoys' | 27.80 | 1 |
| 558273 | 170 | 'C121713128' | '2' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 66.94 | 0 |
```{python}
# Figuring out all of the columns (and their names) available for me to use in
# the dataset
df.columns
```Index(['step', 'customer', 'age', 'gender', 'zipcodeOri', 'merchant',
'zipMerchant', 'category', 'amount', 'fraud'],
dtype='object')
```{python}
# Getting basic information about the dataset
df.info()
```<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, 70803 to 577977
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 step 200000 non-null int64
1 customer 200000 non-null object
2 age 200000 non-null object
3 gender 200000 non-null object
4 zipcodeOri 200000 non-null object
5 merchant 200000 non-null object
6 zipMerchant 200000 non-null object
7 category 200000 non-null object
8 amount 200000 non-null float64
9 fraud 200000 non-null int64
dtypes: float64(1), int64(2), object(7)
memory usage: 16.8+ MB
```{python}
# Figuring out the number of duplicated elements in the dataset (could be
# problematic if not resolved)
df.duplicated().sum()
```0
Additionally, before handing my Financial-Institution Fraud dataset over for Machine Learning training and prediction, I need to clean the data prior to the analysis stage: removing duplicates, deleting null/NaN values, fixing types of columns, filling invalid values with suitable alternatives, etc.
```{python}
# Renaming the columns to be more readable
df = df.rename(columns={"zipcodeOri": "ZipCodeOrig", "step": "TimeStep"})
cols_rename_dict = {}
for col in df.columns:
cols_rename_dict.update({col: str(col[0].upper() + col[1:])})
df = df.rename(columns=cols_rename_dict)
df
```| TimeStep | Customer | Age | Gender | ZipCodeOrig | Merchant | ZipMerchant | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | 'C746862122' | '3' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 11.65 | 0 |
| 470791 | 146 | 'C1760492708' | '1' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 1.60 | 0 |
| 568310 | 172 | 'C1984083185' | '3' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 33.36 | 0 |
| 23709 | 9 | 'C1530262146' | '2' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 8.01 | 0 |
| 49723 | 19 | 'C1471216995' | '3' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | 'C591563666' | '4' | 'M' | '28007' | 'M85975013' | '28007' | 'es_food' | 14.83 | 0 |
| 570570 | 173 | 'C19820082' | '5' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 14.11 | 0 |
| 136543 | 48 | 'C612045723' | '5' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 43.26 | 0 |
| 203448 | 69 | 'C1776310825' | '4' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 7.59 | 0 |
| 577977 | 175 | 'C1267749701' | '2' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 36.30 | 0 |
200000 rows × 10 columns
```{python}
# Figuring out the number of 'null'/'NaN' elements in the dataset (i.e. if NaN
# filling is needed or not)
print(df.isnull().sum())
(df.isnull().sum() / df.shape[0]) * 100
```TimeStep 0
Customer 0
Age 0
Gender 0
ZipCodeOrig 0
Merchant 0
ZipMerchant 0
Category 0
Amount 0
Fraud 0
dtype: int64
TimeStep 0.0
Customer 0.0
Age 0.0
Gender 0.0
ZipCodeOrig 0.0
Merchant 0.0
ZipMerchant 0.0
Category 0.0
Amount 0.0
Fraud 0.0
dtype: float64
```{python}
df
```| TimeStep | Customer | Age | Gender | ZipCodeOrig | Merchant | ZipMerchant | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | 'C746862122' | '3' | 'M' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 11.65 | 0 |
| 470791 | 146 | 'C1760492708' | '1' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 1.60 | 0 |
| 568310 | 172 | 'C1984083185' | '3' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 33.36 | 0 |
| 23709 | 9 | 'C1530262146' | '2' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 8.01 | 0 |
| 49723 | 19 | 'C1471216995' | '3' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | 'C591563666' | '4' | 'M' | '28007' | 'M85975013' | '28007' | 'es_food' | 14.83 | 0 |
| 570570 | 173 | 'C19820082' | '5' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 14.11 | 0 |
| 136543 | 48 | 'C612045723' | '5' | 'F' | '28007' | 'M348934600' | '28007' | 'es_transportation' | 43.26 | 0 |
| 203448 | 69 | 'C1776310825' | '4' | 'M' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 7.59 | 0 |
| 577977 | 175 | 'C1267749701' | '2' | 'F' | '28007' | 'M1823072687' | '28007' | 'es_transportation' | 36.30 | 0 |
200000 rows × 10 columns
```{python}
# Remove the single quotation characters around the following columns' entries
single_quotation_cols: [str] = ["Customer", "Age", "Gender", "ZipCodeOrig", "Merchant", "ZipMerchant", "Category"]
for col in single_quotation_cols:
df[col] = df[col].str.strip("'")
df
```| TimeStep | Customer | Age | Gender | ZipCodeOrig | Merchant | ZipMerchant | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | C746862122 | 3 | M | 28007 | M348934600 | 28007 | es_transportation | 11.65 | 0 |
| 470791 | 146 | C1760492708 | 1 | F | 28007 | M1823072687 | 28007 | es_transportation | 1.60 | 0 |
| 568310 | 172 | C1984083185 | 3 | F | 28007 | M1823072687 | 28007 | es_transportation | 33.36 | 0 |
| 23709 | 9 | C1530262146 | 2 | M | 28007 | M1823072687 | 28007 | es_transportation | 8.01 | 0 |
| 49723 | 19 | C1471216995 | 3 | F | 28007 | M348934600 | 28007 | es_transportation | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | C591563666 | 4 | M | 28007 | M85975013 | 28007 | es_food | 14.83 | 0 |
| 570570 | 173 | C19820082 | 5 | F | 28007 | M1823072687 | 28007 | es_transportation | 14.11 | 0 |
| 136543 | 48 | C612045723 | 5 | F | 28007 | M348934600 | 28007 | es_transportation | 43.26 | 0 |
| 203448 | 69 | C1776310825 | 4 | M | 28007 | M1823072687 | 28007 | es_transportation | 7.59 | 0 |
| 577977 | 175 | C1267749701 | 2 | F | 28007 | M1823072687 | 28007 | es_transportation | 36.30 | 0 |
200000 rows × 10 columns
```{python}
# Filling the "Age" category entries with more appropriate numerical ones
df["Age"] = df["Age"].map(lambda entry: entry if entry != "U" else "-1")
df
```| TimeStep | Customer | Age | Gender | ZipCodeOrig | Merchant | ZipMerchant | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | C746862122 | 3 | M | 28007 | M348934600 | 28007 | es_transportation | 11.65 | 0 |
| 470791 | 146 | C1760492708 | 1 | F | 28007 | M1823072687 | 28007 | es_transportation | 1.60 | 0 |
| 568310 | 172 | C1984083185 | 3 | F | 28007 | M1823072687 | 28007 | es_transportation | 33.36 | 0 |
| 23709 | 9 | C1530262146 | 2 | M | 28007 | M1823072687 | 28007 | es_transportation | 8.01 | 0 |
| 49723 | 19 | C1471216995 | 3 | F | 28007 | M348934600 | 28007 | es_transportation | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | C591563666 | 4 | M | 28007 | M85975013 | 28007 | es_food | 14.83 | 0 |
| 570570 | 173 | C19820082 | 5 | F | 28007 | M1823072687 | 28007 | es_transportation | 14.11 | 0 |
| 136543 | 48 | C612045723 | 5 | F | 28007 | M348934600 | 28007 | es_transportation | 43.26 | 0 |
| 203448 | 69 | C1776310825 | 4 | M | 28007 | M1823072687 | 28007 | es_transportation | 7.59 | 0 |
| 577977 | 175 | C1267749701 | 2 | F | 28007 | M1823072687 | 28007 | es_transportation | 36.30 | 0 |
200000 rows × 10 columns
```{python}
df["Category"].value_counts()
```Category
es_transportation 169900
es_food 8905
es_health 5384
es_wellnessandbeauty 5100
es_fashion 2188
es_barsandrestaurants 2108
es_hyper 2092
es_sportsandtoys 1306
es_tech 794
es_home 647
es_hotelservices 568
es_otherservices 317
es_contents 286
es_travel 240
es_leisure 165
Name: count, dtype: int64
```{python}
# Removing unnecessary "es_" prefixes from the "Category" column entries
df["Category"] = df["Category"].map(lambda entry: entry.replace("es_", ""))
# Rename "Category" column entries to be more readable
df["Category"].replace({"barsandrestaurants": "bars_and_restaurants",
"hotelservices": "hotel_services", "otherservices": "other_services",
"sportsandtoys": "sports_and_toys", "wellnessandbeauty": "wellness_and_beauty"}, inplace=True)
# Fix the capitalization on the entries in the "Category" column for readability
def capitalize_first_letter(entry: str):
word_entries = entry.split("_")
word_entries = [(word[0].upper() + word[1:]) for word in word_entries]
return "_".join(word_entries)
df["Category"] = df["Category"].apply(capitalize_first_letter)
df
```| TimeStep | Customer | Age | Gender | ZipCodeOrig | Merchant | ZipMerchant | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | C746862122 | 3 | M | 28007 | M348934600 | 28007 | Transportation | 11.65 | 0 |
| 470791 | 146 | C1760492708 | 1 | F | 28007 | M1823072687 | 28007 | Transportation | 1.60 | 0 |
| 568310 | 172 | C1984083185 | 3 | F | 28007 | M1823072687 | 28007 | Transportation | 33.36 | 0 |
| 23709 | 9 | C1530262146 | 2 | M | 28007 | M1823072687 | 28007 | Transportation | 8.01 | 0 |
| 49723 | 19 | C1471216995 | 3 | F | 28007 | M348934600 | 28007 | Transportation | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | C591563666 | 4 | M | 28007 | M85975013 | 28007 | Food | 14.83 | 0 |
| 570570 | 173 | C19820082 | 5 | F | 28007 | M1823072687 | 28007 | Transportation | 14.11 | 0 |
| 136543 | 48 | C612045723 | 5 | F | 28007 | M348934600 | 28007 | Transportation | 43.26 | 0 |
| 203448 | 69 | C1776310825 | 4 | M | 28007 | M1823072687 | 28007 | Transportation | 7.59 | 0 |
| 577977 | 175 | C1267749701 | 2 | F | 28007 | M1823072687 | 28007 | Transportation | 36.30 | 0 |
200000 rows × 10 columns
```{python}
df["Category"].value_counts()
```Category
Transportation 169900
Food 8905
Health 5384
Wellness_And_Beauty 5100
Fashion 2188
Bars_And_Restaurants 2108
Hyper 2092
Sports_And_Toys 1306
Tech 794
Home 647
Hotel_Services 568
Other_Services 317
Contents 286
Travel 240
Leisure 165
Name: count, dtype: int64
```{python}
# Rename "Gender" column entries to be more readable
df["Gender"] = df["Gender"].map({"M": "Male", "F": "Female", "E": "Enterprise", "U": "Unknown"})
df
```| TimeStep | Customer | Age | Gender | ZipCodeOrig | Merchant | ZipMerchant | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 70803 | 26 | C746862122 | 3 | Male | 28007 | M348934600 | 28007 | Transportation | 11.65 | 0 |
| 470791 | 146 | C1760492708 | 1 | Female | 28007 | M1823072687 | 28007 | Transportation | 1.60 | 0 |
| 568310 | 172 | C1984083185 | 3 | Female | 28007 | M1823072687 | 28007 | Transportation | 33.36 | 0 |
| 23709 | 9 | C1530262146 | 2 | Male | 28007 | M1823072687 | 28007 | Transportation | 8.01 | 0 |
| 49723 | 19 | C1471216995 | 3 | Female | 28007 | M348934600 | 28007 | Transportation | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | C591563666 | 4 | Male | 28007 | M85975013 | 28007 | Food | 14.83 | 0 |
| 570570 | 173 | C19820082 | 5 | Female | 28007 | M1823072687 | 28007 | Transportation | 14.11 | 0 |
| 136543 | 48 | C612045723 | 5 | Female | 28007 | M348934600 | 28007 | Transportation | 43.26 | 0 |
| 203448 | 69 | C1776310825 | 4 | Male | 28007 | M1823072687 | 28007 | Transportation | 7.59 | 0 |
| 577977 | 175 | C1267749701 | 2 | Female | 28007 | M1823072687 | 28007 | Transportation | 36.30 | 0 |
200000 rows × 10 columns
```{python}
# Convert all numerical columns to the correct type in the dataframe
df["Age"] = df["Age"].astype("int64")
df["ZipCodeOrig"] = df["ZipCodeOrig"].astype("int64")
df["ZipMerchant"] = df["ZipMerchant"].astype("int64")
df.info()
```<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, 70803 to 577977
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TimeStep 200000 non-null int64
1 Customer 200000 non-null object
2 Age 200000 non-null int64
3 Gender 200000 non-null object
4 ZipCodeOrig 200000 non-null int64
5 Merchant 200000 non-null object
6 ZipMerchant 200000 non-null int64
7 Category 200000 non-null object
8 Amount 200000 non-null float64
9 Fraud 200000 non-null int64
dtypes: float64(1), int64(5), object(4)
memory usage: 16.8+ MB
Here, I am trying to offer some visualizations of the cleaned dataset before we pass it over for Machine Learning training and prediction. In this blog post, I wanted to visualize the counts of all of the different types of discrete entries within each descriptive column as a bar graph to show the spread in the graph: Category and Gender.
```{python}
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
sns.barplot(x=df["Category"].unique(), y=df["Category"].value_counts(), palette=sns.color_palette("husl", 8))
plt.xlabel("Transaction Type")
plt.ylabel("Transaction Type Count")
plt.title("Different Types of Category Transactions in this Fin. Inst. Fraud Detection Analysis Dataset")
ax = plt.subplot()
ax.set_xticklabels(list(df["Category"].unique()),
rotation=30,
fontsize="8",
horizontalalignment="right")
plt.show()
df["Category"].value_counts()
```C:\Users\andre\AppData\Local\Temp\ipykernel_23268\2195672969.py:3: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(x=df["Category"].unique(), y=df["Category"].value_counts(), palette=sns.color_palette("husl", 8))
C:\Users\andre\AppData\Local\Temp\ipykernel_23268\2195672969.py:3: UserWarning:
The palette list has fewer values (8) than needed (15) and will cycle, which may produce an uninterpretable plot.
sns.barplot(x=df["Category"].unique(), y=df["Category"].value_counts(), palette=sns.color_palette("husl", 8))
C:\Users\andre\AppData\Local\Temp\ipykernel_23268\2195672969.py:8: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_xticklabels(list(df["Category"].unique()),
Category
Transportation 169900
Food 8905
Health 5384
Wellness_And_Beauty 5100
Fashion 2188
Bars_And_Restaurants 2108
Hyper 2092
Sports_And_Toys 1306
Tech 794
Home 647
Hotel_Services 568
Other_Services 317
Contents 286
Travel 240
Leisure 165
Name: count, dtype: int64
```{python}
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
sns.barplot(x=df["Gender"].unique(), y=df["Gender"].value_counts(), palette=sns.color_palette("husl", 8))
plt.xlabel("Transaction Type")
plt.ylabel("Transaction Type Count")
plt.title("Different Types of Entities in this Fin. Inst. Fraud Detection Analysis Dataset")
ax = plt.subplot()
ax.set_xticklabels(list(df["Gender"].unique()),
rotation=30,
fontsize="8",
horizontalalignment="right")
plt.show()
df["Gender"].value_counts()
```C:\Users\andre\AppData\Local\Temp\ipykernel_23268\2824611545.py:3: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(x=df["Gender"].unique(), y=df["Gender"].value_counts(), palette=sns.color_palette("husl", 8))
C:\Users\andre\AppData\Local\Temp\ipykernel_23268\2824611545.py:3: UserWarning: The palette list has more values (8) than needed (4), which may not be intended.
sns.barplot(x=df["Gender"].unique(), y=df["Gender"].value_counts(), palette=sns.color_palette("husl", 8))
C:\Users\andre\AppData\Local\Temp\ipykernel_23268\2824611545.py:8: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_xticklabels(list(df["Gender"].unique()),
Gender
Female 109264
Male 90181
Enterprise 386
Unknown 169
Name: count, dtype: int64
```{python}
# Set the display precision for floating-point numbers to 3 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
df.loc[:, ["TimeStep", "Amount"]].describe()
```| TimeStep | Amount | |
|---|---|---|
| count | 200000.00 | 200000.00 |
| mean | 95.07 | 37.76 |
| std | 51.02 | 109.52 |
| min | 0.00 | 0.00 |
| 25% | 52.00 | 13.77 |
| 50% | 97.00 | 26.94 |
| 75% | 139.00 | 42.52 |
| max | 179.00 | 7403.44 |
```{python}
# Step = Map of unit of time in the real world. 1 step = 1 hour
df["TimeStep"].value_counts()
```TimeStep
175 1318
178 1304
177 1297
151 1286
163 1277
...
6 837
7 828
3 818
1 787
2 781
Name: count, Length: 180, dtype: int64
Additionally, I wanted to visualize the counts of all of the different types of continuous entries within each descriptive column as count plots to show the distribution/spread in the graph: TimeStep and Amount.
```{python}
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
plt.title('Distribution of Time Feature')
sns.countplot(x="TimeStep", data=df)
```<Axes: title={'center': 'Distribution of Time Feature'}, xlabel='TimeStep', ylabel='count'>
```{python}
df["Amount"].value_counts()
```Amount
13.32 61
23.35 57
7.18 57
0.49 57
14.88 57
..
127.28 1
105.20 1
201.68 1
5736.56 1
89.20 1
Name: count, Length: 15276, dtype: int64
```{python}
amount_counts = df["Amount"].value_counts()
amount_filter_value: float = 250.00
df_amount_dist_display_sample = df[df["Amount"] <= amount_filter_value]
num_entries_above_amount_filter_value = df[df["Amount"] > amount_filter_value].shape[0]
perc_amount_above_amount_filter_value = (float(num_entries_above_amount_filter_value) / df.shape[0]) * 100
print(f"Note: The number of entries above the filtered amount value of {int(amount_filter_value)} is {num_entries_above_amount_filter_value} \
({perc_amount_above_amount_filter_value:.2f}% of total entries).")
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
plt.title('Distribution of Amount Feature')
sns.countplot(x="Amount", data=df_amount_dist_display_sample, color="red")
plt.show()
```Note: The number of entries above the filtered amount value of 250 is 1828 (0.91% of total entries).
```{python}
counts_fraud_col = df["Fraud"].value_counts()
normal_cases, fraud_cases = counts_fraud_col[0], counts_fraud_col[1]
percent_normal = (normal_cases / (normal_cases + fraud_cases)) * 100
percent_fraud = (fraud_cases / (normal_cases + fraud_cases)) * 100
results = f"There were {normal_cases} non-fraudulent transactions ({percent_normal:.3f}%) \
and {fraud_cases} fradulent transactions ({percent_fraud:.3f}%)"
results
```'There were 197641 non-fraudulent transactions (98.820%) and 2359 fradulent transactions (1.179%)'
Lastly, I wanted to visualize the counts of all of the different types of entries within the Fraud column as a bar graph to show the spread across the dataset.
```{python}
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
sns.barplot(x=counts_fraud_col.index, y=counts_fraud_col, palette=sns.color_palette("husl", 8))
plt.title("Comparison of the Number of Fradulent vs. Non-Fraudulent Transactions")
plt.ylabel("Count")
plt.xlabel("Class: (0 - Non-Fraudulent vs. 1 - Fradulent)")
plt.show()
```C:\Users\andre\AppData\Local\Temp\ipykernel_23268\207855673.py:3: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(x=counts_fraud_col.index, y=counts_fraud_col, palette=sns.color_palette("husl", 8))
C:\Users\andre\AppData\Local\Temp\ipykernel_23268\207855673.py:3: UserWarning: The palette list has more values (8) than needed (2), which may not be intended.
sns.barplot(x=counts_fraud_col.index, y=counts_fraud_col, palette=sns.color_palette("husl", 8))
Here are some other notable data analytics I did with the dataset: (1) seeing the mean values across each spending category as well as (2) finding a percentage between the amount of Fraud vs. Non-Fraud cases in each spending category (seeing if there was an indirect correlation potentially that could detriment or support my later Machine Learning findings).
```{python}
# Printing out the mean values across each category
print("Mean Feature Values per Category")
df_category_grouping_amount_mean = df.groupby('Category')["Amount"].mean()
df_category_grouping_fraud_mean = df.groupby('Category')["Fraud"].mean()
df_mean = pd.concat([df_category_grouping_amount_mean, df_category_grouping_fraud_mean], keys=["Amount", "Fraud"])
df_mean = pd.DataFrame(index=df["Category"].unique())
df_mean = pd.merge(left=df_mean, right=df_category_grouping_amount_mean, how="inner", left_on=df_mean.index, right_on=df_category_grouping_amount_mean.index)
df_mean.rename(columns={"key_0": "Category"}, inplace=True)
df_mean.set_index(keys="Category", drop=True, inplace=True)
df_mean = pd.merge(left=df_mean, right=df_category_grouping_fraud_mean, how="inner", left_on=df_mean.index, right_on=df_category_grouping_fraud_mean.index)
df_mean.rename(columns={"key_0": "Category"}, inplace=True)
df_mean.set_index(keys="Category", drop=True, inplace=True)
df_mean
```Mean Feature Values per Category
| Amount | Fraud | |
|---|---|---|
| Category | ||
| Transportation | 27.00 | 0.00 |
| Sports_And_Toys | 218.29 | 0.50 |
| Bars_And_Restaurants | 42.44 | 0.02 |
| Wellness_And_Beauty | 64.80 | 0.04 |
| Tech | 121.94 | 0.07 |
| Fashion | 63.70 | 0.02 |
| Home | 166.42 | 0.15 |
| Food | 36.68 | 0.00 |
| Health | 137.75 | 0.11 |
| Hyper | 45.55 | 0.04 |
| Travel | 2231.48 | 0.82 |
| Hotel_Services | 203.39 | 0.30 |
| Contents | 47.10 | 0.00 |
| Other_Services | 130.58 | 0.24 |
| Leisure | 278.32 | 0.92 |
```{python}
# Figuring out and displaying the differences between Fraud vs. Non-Fraud Cases
# in each spending category
df_non_fraud = df[df["Fraud"] == 0]
df_fraud = df[df["Fraud"] == 1]
pd.concat([df_fraud.groupby("Category")["Amount"].mean(), df_non_fraud.groupby("Category")["Amount"].mean(), \
df.groupby("Category")["Fraud"].mean() * 100], keys=["Fraudulent", "Non-Fradulent", "Percentage (%)"], axis=1, \
sort=False).sort_values(by=["Non-Fradulent"])
```| Fraudulent | Non-Fradulent | Percentage (%) | |
|---|---|---|---|
| Category | |||
| Transportation | NaN | 27.00 | 0.00 |
| Food | NaN | 36.68 | 0.00 |
| Hyper | 173.07 | 39.68 | 4.40 |
| Bars_And_Restaurants | 120.29 | 41.20 | 1.57 |
| Contents | NaN | 47.10 | 0.00 |
| Wellness_And_Beauty | 228.53 | 57.35 | 4.35 |
| Fashion | 229.22 | 61.09 | 1.55 |
| Other_Services | 311.61 | 73.49 | 23.97 |
| Leisure | 295.65 | 75.69 | 92.12 |
| Sports_And_Toys | 349.86 | 88.33 | 49.69 |
| Tech | 410.35 | 100.47 | 6.93 |
| Health | 411.24 | 104.61 | 10.81 |
| Hotel_Services | 426.41 | 105.72 | 30.46 |
| Home | 491.96 | 111.08 | 14.53 |
| Travel | 2582.51 | 623.28 | 82.08 |
```{python}
# Plot histograms of the amounts in fraud and non-fraud data
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
hist_bins: int = 100
plt.hist(df_fraud["Amount"], alpha=0.5, label="Fraud", bins=hist_bins)
plt.hist(df_non_fraud["Amount"], alpha=0.5, label="Non-Fraud", bins=hist_bins)
plt.title("Histogram Comparing the Distribution of Fraud vs. Non-Fraud Payments")
plt.xlabel("Amount")
plt.ylabel("Count")
plt.xlim(0, 1000)
plt.ylim(0, 10000)
plt.legend()
plt.show()
```Just before I move onto the Machine Learning section of this report, I wanted to clean up some last things with the dataset such as removing unnecessary columns in the current dataframe, rounding numerical columns for readability, and factorizing (integer-assigning) categorical columns into numerical ones to feed as factors while doing Machine Learning.
```{python}
# Removed unnecessary columns not needed for Machine Learning analysis
df.drop(labels=["Customer", "ZipCodeOrig", "Merchant", "ZipMerchant"], axis=1, inplace=True)
df
```| TimeStep | Age | Gender | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|
| 70803 | 26 | 3 | Male | Transportation | 11.65 | 0 |
| 470791 | 146 | 1 | Female | Transportation | 1.60 | 0 |
| 568310 | 172 | 3 | Female | Transportation | 33.36 | 0 |
| 23709 | 9 | 2 | Male | Transportation | 8.01 | 0 |
| 49723 | 19 | 3 | Female | Transportation | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | 4 | Male | Food | 14.83 | 0 |
| 570570 | 173 | 5 | Female | Transportation | 14.11 | 0 |
| 136543 | 48 | 5 | Female | Transportation | 43.26 | 0 |
| 203448 | 69 | 4 | Male | Transportation | 7.59 | 0 |
| 577977 | 175 | 2 | Female | Transportation | 36.30 | 0 |
200000 rows × 6 columns
```{python}
# Round the Amount column appropriately (for readability)
df["Amount"] = df["Amount"].round(2)
df
```| TimeStep | Age | Gender | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|
| 70803 | 26 | 3 | Male | Transportation | 11.65 | 0 |
| 470791 | 146 | 1 | Female | Transportation | 1.60 | 0 |
| 568310 | 172 | 3 | Female | Transportation | 33.36 | 0 |
| 23709 | 9 | 2 | Male | Transportation | 8.01 | 0 |
| 49723 | 19 | 3 | Female | Transportation | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | 4 | Male | Food | 14.83 | 0 |
| 570570 | 173 | 5 | Female | Transportation | 14.11 | 0 |
| 136543 | 48 | 5 | Female | Transportation | 43.26 | 0 |
| 203448 | 69 | 4 | Male | Transportation | 7.59 | 0 |
| 577977 | 175 | 2 | Female | Transportation | 36.30 | 0 |
200000 rows × 6 columns
```{python}
df["Category"].value_counts()
```Category
Transportation 169900
Food 8905
Health 5384
Wellness_And_Beauty 5100
Fashion 2188
Bars_And_Restaurants 2108
Hyper 2092
Sports_And_Toys 1306
Tech 794
Home 647
Hotel_Services 568
Other_Services 317
Contents 286
Travel 240
Leisure 165
Name: count, dtype: int64
```{python}
# Convert any needed categorical columns into numerical ones via factorizing (integer mapping)
for col in df.columns:
if not pd.api.types.is_numeric_dtype(df[col]):
df[col] = pd.factorize(df[col])[0]
``````{python}
df
```| TimeStep | Age | Gender | Category | Amount | Fraud | |
|---|---|---|---|---|---|---|
| 70803 | 26 | 3 | 0 | 0 | 11.65 | 0 |
| 470791 | 146 | 1 | 1 | 0 | 1.60 | 0 |
| 568310 | 172 | 3 | 1 | 0 | 33.36 | 0 |
| 23709 | 9 | 2 | 0 | 0 | 8.01 | 0 |
| 49723 | 19 | 3 | 1 | 0 | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 363474 | 116 | 4 | 0 | 7 | 14.83 | 0 |
| 570570 | 173 | 5 | 1 | 0 | 14.11 | 0 |
| 136543 | 48 | 5 | 1 | 0 | 43.26 | 0 |
| 203448 | 69 | 4 | 0 | 0 | 7.59 | 0 |
| 577977 | 175 | 2 | 1 | 0 | 36.30 | 0 |
200000 rows × 6 columns
```{python}
df.info()
```<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, 70803 to 577977
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TimeStep 200000 non-null int64
1 Age 200000 non-null int64
2 Gender 200000 non-null int64
3 Category 200000 non-null int64
4 Amount 200000 non-null float64
5 Fraud 200000 non-null int64
dtypes: float64(1), int64(5)
memory usage: 10.7 MB
As all of the data in the working dataframe is now numerical, I wanted to preliminary-wise see the quantifiable correlation between the Fraud column and the rest in the dataset. Thus, using the corr function for dataframes with the Fraud column, I created a heatmap and bar-graph visualizing the constrast between each financial institution entries’ attribute columns and its Fraud value.
```{python}
# Correlation heatmap to quantify relationships between financial transaction attributes
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
sns.heatmap(df.corr(), annot=True, linewidths=0.5)
plt.title("Correlation Heatmap Between All Financial Institution Quantiative Factors (2005-2015)")
plt.show()
# Correlation bar graph between Fraud and all other financial transaction attributes
target_corr = df.corr()["Fraud"].abs().sort_values(ascending=False)
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
sns.barplot(x=target_corr.index[1:], y=target_corr.values[1:], palette=sns.color_palette("husl", 8))
plt.xticks(rotation=45, ha="right")
plt.xlabel("Financial Transaction Features")
plt.ylabel("Correlation with Fraud")
plt.title("Correlation between Fraud and Other Features When Comparing Across All Financial Transactions in the Dataset (2005-2015)")
plt.tight_layout()
plt.show()
```C:\Users\andre\AppData\Local\Temp\ipykernel_23268\3111820026.py:12: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(x=target_corr.index[1:], y=target_corr.values[1:], palette=sns.color_palette("husl", 8))
C:\Users\andre\AppData\Local\Temp\ipykernel_23268\3111820026.py:12: UserWarning: The palette list has more values (8) than needed (5), which may not be intended.
sns.barplot(x=target_corr.index[1:], y=target_corr.values[1:], palette=sns.color_palette("husl", 8))
```{python}
df = df.drop(["TimeStep"], axis=1)
df
```| Age | Gender | Category | Amount | Fraud | |
|---|---|---|---|---|---|
| 70803 | 3 | 0 | 0 | 11.65 | 0 |
| 470791 | 1 | 1 | 0 | 1.60 | 0 |
| 568310 | 3 | 1 | 0 | 33.36 | 0 |
| 23709 | 2 | 0 | 0 | 8.01 | 0 |
| 49723 | 3 | 1 | 0 | 38.11 | 0 |
| ... | ... | ... | ... | ... | ... |
| 363474 | 4 | 0 | 7 | 14.83 | 0 |
| 570570 | 5 | 1 | 0 | 14.11 | 0 |
| 136543 | 5 | 1 | 0 | 43.26 | 0 |
| 203448 | 4 | 0 | 0 | 7.59 | 0 |
| 577977 | 2 | 1 | 0 | 36.30 | 0 |
200000 rows × 5 columns
Machine Learning - Model Training and Evaluation
Great, now we are onto the Machine Learning part of the blog post!
Since the dataframe is now properly cleaned, sorted, and integer-mapped by this point, I had split the respective dataframe into the train and test datasets for the Machine Learning model with 80% going to the training dataset and the last 20% going to the test dataset. Fortunately, because order of the data sequentially does not matter here, I was able to utilize the train_test_split function for shuffling and randomization, making the future-generated Machine Learning model more unpredictable but also more objective in its returned model results.
Note that here I also used a Pipeline object from the scikit-learn package as well as the StandardScaler classes. The StandardScaler class is useful for apply Z-score normalization / transformation on the data to avoid sensivite-prone Machine Learning algorithms which require appropriate scaling of the features within its trained dataset. As I learned from online, this Pipeline object is necesary to ensure appropriate preprocessing just before the dataset is passed to the Machine Learning model for training and later evaluation.
```{python}
X = df.drop(["Fraud"], axis=1)
y = df["Fraud"]
print("X Shape:", X.shape)
print("Y Shape:", y.shape)
pipeline = Pipeline([
("std_scaler", StandardScaler())
])
X_scaled = pipeline.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, random_state=42)
```X Shape: (200000, 4)
Y Shape: (200000,)
Due to the higher-order dimensionality in the X columns, I had to utilze PCA to reduce the X columns in the dataset down to a manageable amount for humans (most accepted is 2 principal components).
```{python}
# Used PCA to lower the dimensionality of the dataset used later for Machine
# Learning (training and testing)
pca = PCA()
principal_components = pca.fit_transform(X_test)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title("Explained Variance by Principal Components")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
``````{python}
# Created a new dataframe containing the 2 main principal components used in the
# later Machine Learning section of this blog-post
pca_df = pd.DataFrame(data=principal_components[:, :2], columns=["PC1", "PC2"])
pca_df
```| PC1 | PC2 | |
|---|---|---|
| 0 | -0.60 | -0.69 |
| 1 | -0.27 | 1.19 |
| 2 | -0.39 | 0.14 |
| 3 | -0.17 | 1.72 |
| 4 | -0.20 | -0.20 |
| ... | ... | ... |
| 39995 | -0.22 | -0.20 |
| 39996 | -0.56 | -0.69 |
| 39997 | -0.58 | -0.69 |
| 39998 | 6.55 | -0.24 |
| 39999 | -0.60 | -0.69 |
40000 rows × 2 columns
```{python}
y_test.value_counts()
```Fraud
0 39527
1 473
Name: count, dtype: int64
Adding in the y column ("Fraud") to this modified dataframe containing two new PCA components, we are able to experimentally look at some basic labeling of the data prior to clustering.
(Disclaimer: I realize that clustering is considered unsupervised Machine Learning. The experimental practices of labels here are merely for reference to anticipate future clustering patterns. In real clustering applications, I realize that I wouldn’t really have the labels to cluster in the first place. As seen throughout this section, this practice doesn’t intend to skew my Machine Learning methods.)
```{python}
# Add the "Fraud" column to the PCA DataFrame
pca_df["Fraud"] = y_test.values
# Separate the data based on the "Fraud" column
fraud_points = pca_df[pca_df["Fraud"] == 1]
not_fraud_points = pca_df[pca_df["Fraud"] == 0]
# Visualize "fraud" and "not fraud" PCA data
plt.scatter(not_fraud_points["PC1"], not_fraud_points["PC2"], c='blue', label='Not Fraud', alpha=0.5)
plt.scatter(fraud_points["PC1"], fraud_points["PC2"], c='red', label='Fraud', alpha=0.5)
plt.title("PCA Plot of Fraud vs. Not-Fraud Data")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.legend()
plt.show()
```Leading into the DBSCAN analysis, we will do an exploratory K-Means Clustering to see if we detect any patterns or correlations in the dataset between Fraud and other categories that were reported with the existing financial transactions.
```{python}
# Using the KElbowVisualizer to determine the appropriate number of clusters
# to use in the K-Means algorithm
kmeans_model = KMeans()
kmeans_elbow_visualizer = KElbowVisualizer(kmeans_model, k=(1, 11))
kmeans_elbow_visualizer.fit(X_train)
kmeans_elbow_visualizer.show()
```C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
```{python}
# Complete K-Means Clustering to find any clustering relationships in dataset
# Run the K-Means model on scaled data
kmeans_model = KMeans(n_clusters=4, random_state=42).fit(X_train)
# Get the cluster number for each datapoint
X_clusters = kmeans_model.predict(X_test)
# Save the cluster centroids
X_clusters_centers = kmeans_model.cluster_centers_
# Plot PCA and K-Means cluster centers
X_test_principal_clusters_centers = pca.transform(X_clusters_centers)
plt.scatter(principal_components[:, 0], principal_components[:, 1], c=X_clusters, cmap='viridis', s=50, alpha=0.5)
plt.scatter(X_test_principal_clusters_centers[:, 0], X_test_principal_clusters_centers[:, 1], c='red', marker='X', s=200, label='Cluster Centers')
plt.title("Explained Variance by Principal Components")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
```C:\Users\andre\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
```{python}
# Obtain predictions and calculate distance from cluster centroid (using Euclidean Distance)
kmeans_dist = [np.linalg.norm(x - y) for x, y in zip(X_test, X_clusters_centers[X_clusters])]
y_pred = np.array(kmeans_dist)
y_pred[kmeans_dist >= np.percentile(kmeans_dist, 95)] = 1
y_pred[kmeans_dist < np.percentile(kmeans_dist, 95)] = 0
y_pred
```array([0., 0., 0., ..., 0., 1., 0.])
```{python}
# Display the accuracy statistics and the confusion matrix of the K-Means algorithm
# cluster predictions
clf_report = pd.DataFrame(classification_report(y_true=y_test, y_pred=y_pred, output_dict=True, zero_division=0))
conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(f"ROC AUC Score: {roc_auc_score(y_true=y_test, y_score=y_pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix:\n{conf_matrix}")
# Plot the confusion matrix
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Confusion Matrix')
plt.show()
```ROC AUC Score: 81.49%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 0.16 0.95 0.58 0.99
recall 0.96 0.67 0.95 0.81 0.95
f1-score 0.98 0.26 0.95 0.62 0.97
support 39527.00 473.00 0.95 40000.00 40000.00
_______________________________________________
Confusion Matrix:
[[37845 1682]
[ 155 318]]
Now to the main focus: the DBSCAN analysis. We use the NearestNeighbor algorithm, originally intended to implement for neighbor searching in unsupervised settings, to find the optimal parameters for the DBSCAN analysis. DBSCAN here is necessary because it bulk-computes all neighborhood queries and is super effective in finding outliers within supposed clusters, answering the original question in the first place of anomaly detection.
```{python}
# Methodical way to find an appropriate number for the DBSCAN `min_samples` argument
dbscan_min_samples = 2 * (X.shape[1])
dbscan_min_samples
```8
```{python}
# Completing NearestNeighbors algorithm to find the appropriate `eps` (epsilon value)
# from the y-coordinate of the inflection point (methodical way)
nn_model = NearestNeighbors(n_neighbors=dbscan_min_samples)
nn_model_fit = nn_model.fit(X_test)
nn_distances, nn_indices = nn_model_fit.kneighbors(X_test)
``````{python}
# Plot the (sorted) NearestNeighnor distances between groups of points
nn_distances = np.sort(nn_distances, axis=0)
nn_distances = nn_distances[:, 1]
plt.figure(figsize=(10, 8))
plt.rcParams["font.size"] = 7
plt.plot(nn_distances)
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.title("Distances Between Neighbors for All Data Points")
plt.show()
``````{python}
# Defining the inflection point (x, y) using the Kneed package
kneedle_model = kneed.KneeLocator(y=nn_distances, x=np.arange(0, X_test.shape[0]),
S=1.0, curve="convex", direction="increasing")
nn_model_inflection_point = [kneedle_model.knee, kneedle_model.knee_y]
nn_model_inflection_point
```[39816, 0.4188503437268385]
```{python}
# Visualizing the kneedle model
kneedle_model.plot_knee()
``````{python}
df["Fraud"].value_counts()
```Fraud
0 197641
1 2359
Name: count, dtype: int64
```{python}
# Methodical way of getting the `eps` argument for DBSCAN
rounded_nn_model_inflection_point_y_coord = round(nn_model_inflection_point[1], 2)
rounded_nn_model_inflection_point_y_coord
```0.42
```{python}
# Apply PCA to reduce dimensionality
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Apply DBSCAN to the reduced-dimensional training data
dbscan = DBSCAN(eps=rounded_nn_model_inflection_point_y_coord, min_samples=dbscan_min_samples)
dbscan.fit(X_train_pca)
# Apply DBSCAN to the reduced-dimensional testing data
labels_test = dbscan.fit_predict(X_test_pca)
```Finally, we can conclude the build-up of setting up and executing DBSCAN to visualize the "outlier" points of the dataset (i.e. the Fraud points across all financial transactions). This illustrates the extent of the anomalies detected across this dataset. Even though it marks a great deal of the dataset points as Fraud in this DBSCAN, we are able to validate the legitimacy of this with the KMeans analysis done above, showing similar projections as described here.
```{python}
# Visualize the DBSCAN results on the testing data
plt.figure(figsize=(10, 8))
plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1], c=labels_test, edgecolors='k', cmap='viridis')
# Highlight outliers (those labeled as -1 by DBSCAN)
outliers_mask_test = (labels_test == -1)
plt.scatter(X_test_pca[outliers_mask_test, 0], X_test_pca[outliers_mask_test, 1], c='red', edgecolors='k', s=100, label='Outliers')
plt.title('DBSCAN Clustering After PCA (Testing)')
plt.legend()
plt.show()
```Conclusions
Looking at the clustering final results, I found that the overall Machine Learning methods of using
DBSCANandK-Means Clusteringtogether in the models to be mostly effective in detecting cases of fraud vs. legitimate financial institution (i.e. bank) transactions when comparing against the objective testing datasets. The rationale of executing theK-Meansalgorithm in this blog-post is to use that as an approximate benchmark to compare against the effectivness ofDBSCANin detecting cases of fraud via clustering. After visual inspection, I can safely justify my previous claims and recommend this be used to classify any basic causes of financial institution fraud in future applications.In terms of expanding the scope of this blog-post, I would have taken this further by trying to determine if there are specific factors in the entries marked as financial-institutional fraud that more strongly correlate with fraud-like behavior. This will be effective in making potentially more secure fraud-protected systems which guard against these factors.
Seeing how I a great deal of issues with DBSCAN, due to algorithmic complexity within the iternals of this Machine Learning technique, for outlier detection, I will most likely be careful of using this technique in the future if I ever need to anomaly or outlier detection as this is listed on many reputable resources for being inefficient. Sure, the effectiveness of this unsupervised technique cannot be overstated as it does not need any previous labeling to cluster appropriately. However, I will only use this technique with smaller-sized datasets and be aware that if I must use DBSCAN in the future, I will perform careful hyperparameter tuning (balancing the
epsandmin_samplesarguments) and complete dimensionality reduction before calling DBSCAN withPCAto ensure the most ideal results (or at least results that do not time-out due to GPU resource constraints).Ultimately, I learned a great deal from the blog post experience as I now better understand how to properly utilize
DBSCANand perform outlier or anomaly detection techniques, which are unsupervised classification method of Machine Learning, in order to find relationships between groups of data and infer conclusions about any potential trends, through applying it to a practical, every-day dilemma in our society. Even though the datasets in the real world are never this simple to analyze, I hope to utilize in future contexts as this was an interesting case study to analyze given its real-life implications in modern financial security systems.
Reference Sources and Citations (IEEE Format)
To complete this blog post, I used the following online sources as references for developing this:
[1] BankSim Payments Simulator - Fraud Detection Dataset:
- E. Lopez-Rojas. “Synthetic Data From a Financial Payment System”, 2015. [Online]. Available: https://www.kaggle.com/datasets/ealaxi/banksim1. [Accessed 15-Nov.-2023].
[2] KMeans Elbow Reference:
- F. Javier Gallego, “Outliers+EDA+Clustering Tutorial”, Jul.-2022. [Online]. Available: https://www.kaggle.com/code/javigallego/outliers-eda-clustering-tutorial. [Accessed 16-Nov.-2023].
[3] KMeans Reference:
- M. Isbaine, “Fraud Detection”, 2021. [Online]. Available: https://www.kaggle.com/code/mohamedisbaine/fraud-detection. [Accessed 16-Nov.-2023].
[4] NearestNeighbor, KNeedle, DBSCAN Reference:
- R. De Menezes Gomes, “Fraud Detection - Clustering with DBSCAN”, Jan.-2023. [Online.] Available: https://www.kaggle.com/code/rodmnzs/fraud-detection-clustering-with-dbscan. [Accessed 15-Nov.-2023].
[5] DBSCAN Reference #2:
- D. Valeti, “DBSCAN Algorithm for Fraud Detection & Outlier Detection in a Data Set”, 18-Oct.-2021. [Online]. Available: https://medium.com/@dilip.voleti/dbscan-algorithm-for-fraud-detection-outlier-detection-in-a-data-set-60a10ad06ea8. [Accessed 15-Nov.-2023].
[6] Seaborne Color Palette Reference:
- Seaborn, “Choosing Color Palettes”. [Online]. Available: https://seaborn.pydata.org/tutorial/color_palettes.html. [Accessed 16-Nov.-2023].